Analysis

14 dicembre, 2020


Analysis through Regressions: Methodolody, Selection and Justification

To further study the impact of the different factor on criminality we try to exploit econometric regressions.

We start with a standard OLS for a model, where we consider total criminality per 1000 people, as the sum of rape, homicide, violent crime and aggravated assault, all in per 1000 terms. The OLS model is: \[ \begin{align*} Criminality \, per \, 1000 \, inhabitants&=\alpha+\beta_1log(GDP)+\beta_2mh\_exp\_pc+\beta_3perc\_bscholder\_25\_44+\\ & \,\,+\beta_4 White+\beta_5 BlackAfricanAmerican+\beta_6Asian + \\ & \,\, + \beta_7Age\_0\_17 + \beta_8Age\_18\_24+ \beta_9Age\_25\_44+ \\ & \,\,+\beta_{10}Age\_45\_64+ \beta_{11}Age\_65\_84+\beta_{12}log(population) \end{align*} \] Running the regression we obtain the following coefficients’ estimates:

Standard OLS, dependent variable: Total Criminality per 1000 inhabitants
  total_criminality
Predictors Estimates CI p
(Intercept) -146.90 -254.92 – -38.88 0.008
Current_dollar_GDP_millions
[log]
4.73 3.52 – 5.94 <0.001
mh_exp_pc [log] -0.30 -0.73 – 0.12 0.162
perc_bscholder_25_44 -0.09 -0.15 – -0.03 0.002
White -30.76 -36.96 – -24.57 <0.001
BlackAfricanAmerican -19.52 -25.57 – -13.48 <0.001
Asian -58.17 -69.43 – -46.90 <0.001
Age_0_17 158.71 50.56 – 266.85 0.004
Age_18_24 197.17 78.31 – 316.04 0.001
Age_25_44 241.96 142.86 – 341.06 <0.001
Age_45_64 161.16 55.17 – 267.14 0.003
Age_65_84 255.53 128.73 – 382.32 <0.001
population [log] -4.20 -5.41 – -2.98 <0.001
Observations 505
R2 / R2 adjusted 0.648 / 0.639

We notice that GDP, mh_exp_pc and education’s proxy have coefficients which we could have expected by the EDA we have done previously. Indeed, GDP increases criminality while mental health expenditure and education seems to decrease it. Although, among them only \(log(GDP)\) and education are statistically significant. Surprisingly, all races have a negative effect on criminality; this doesn’t sound a convincing result since the correlation of criminality with black-african american seemed positive in the corrplot in the EDA section. By looking at the table, we see that all groups of age in percentage of the population are significant. Although, being all coefficients positive, we think there could be some mi-specification leading to biased estimators. In general, we don’t think this regression can be informative for us, since we are not considering characteristics specific to the country and the year. Indeed, using a standard OLS we ignore the fact that our data-set is a panel data.

Therefore, we tried to identify our data-frame as a panel data and to compute regression with fixed effect, random effect and first difference. Before proceeding we will explain briefly each of them:

  • Fixed Effect: Using a “within” method allows to control for variables which remains constant over timaqe. In our case, any change given from being a certain state in US to criminality, is the same.
  • Random Effect: these are the opposite of the onee above. Taking random effects into account, is like taking into account effects which are unpredictable.
  • First Difference: this method is used to deal with omitted variable problem in panel data and it is consistent under the same assumption of the fixed effect method. As the fixed effect method it accounts for effects which are constant over time, indeed with T=2 the two should give the same result.

We try to run all regression, but after some consideration we think the most appropriate for our case is fixed effect method and the reasons are:

  • Doing an Hausman test between the fixed effect and the random effect regression we end up selecting the first one. Indeed, the random effect method includes additional strong assumptions (such as unobserved heterogeneity and independent variables being uncorrelated) than the fixed effect. If these are true, then it would be more efficient to accept the coefficients resulting from the random effect regression. Although, if these assumptions don’t hold we would have wrong results. We try the Hausman test for many regression (with dependent variable: each crime separately and total crimes as well as total crimes minus rape) and for all, it turns out that we should favor the fixed effect regression. i.e.: for total criminality as dependent variable we obtain a low p-value. This means that we can reject the hypothesis of the two regressions giving same results with a 1% statistical significance. Thus, the random effect method would give us biased results. (Notice that when computing the regression with Random Effects you can include also control variables which are constant overtime; indeed, we include also region when trying the RE regression).
  • Both fixed effect and first difference take into account fixed effect, allowing us to deal with possible omitted variable which are constant overtime. This holds since these method deals with time invariant unobserved variables. Indeed, first difference method is another way to remove unobserved heterogeneities subtracting the lagged observation rather than group mean, as in fixed effect. First differencing is usually suggested when the number of observations N is small, and you have observation for a long time framework (i.e. T is large). Although, in our case, we only have T=10, while we have 52 different unit, if we consider the total United states too, 51 otherwise. For this reason we decide to use Fixed Effect.

An additional consideration we do is whether to use or not clustered standard errors. The advantage of using them would be to account for within-cluster correlation or heteroskedasticity which the fixed-effects estimator alone does not take into account. Notice that cluster-adjusted standard error take into account standard error but leave your point estimates unchanged. The results are not changing in a relevant way considering clustered-adjusted standard errors or not, though.

We would like to point out also another thought we had while running regressions. In the EDA part we have seen how Rape seems to be the only kind of crime, among the one we are considering, to behave and to be influenced differently by GDP and slightly also by the other variables. For this reason we tried to run different regressions, with as dependent variable (in per 1000 term):

  • a group of all crimes but rape
  • all crimes
  • each single crime on its own

In all the regressions we don’t consider Unites States since would be redundant, being a total of the other states.

Answers to the research questions

We report here the results which are worth mentioning in our opinion. As said above, we select the fixed effect method. The model is: \[ \begin{align*} Y_{i,t} &=\alpha+\beta_1log(GDP)+\beta_2mh\_exp\_pc+\beta_3perc\_bscholder\_25\_44+\\ & \,\,+\beta_4 White+\beta_5 BlackAfricanAmerican+\beta_6Asian + \\ & \,\, + \beta_7Age\_0\_17 + \beta_8Age\_18\_24+ \beta_9Age\_25\_44+ \\ & \,\,+\beta_{10}Age\_45\_64+ \beta_{11}Age\_65\_84+\beta_{12}log(population) \end{align*} \] \(Y_{i,t}\) refers to the dependent variable for state \(i\) at time \(t\). The estimation is done considering \(Y_{i,t}-\bar{Y_i}\), where \(\bar{Y_i}\) is the mean dependent variable for the state \(i\). indeed \(\alpha\) will not appear in the results, as it is constant overtime.

For total criminality regression’s results are:

Fixed Effect, dependent variable: Total Criminality per 1000 inhabitants
  total_criminality
Predictors Estimates CI p
Current_dollar_GDP_millions
[log]
3.88 2.74 – 5.01 <0.001
mh_exp_pc [log] 0.16 -0.20 – 0.51 0.392
perc_bscholder_25_44 -0.05 -0.13 – 0.02 0.149
White -33.85 -81.35 – 13.64 0.163
BlackAfricanAmerican -3.33 -53.89 – 47.23 0.897
Asian -75.27 -131.71 – -18.83 0.009
Age_0_17 268.59 87.68 – 449.50 0.004
Age_18_24 242.92 42.55 – 443.30 0.018
Age_25_44 260.26 69.73 – 450.78 0.008
Age_45_64 267.36 70.77 – 463.94 0.008
Age_65_84 246.68 50.23 – 443.12 0.014
population [log] -15.85 -20.10 – -11.59 <0.001
Observations 505
R2 / R2 adjusted 0.471 / 0.397

We can notice that the \(R^2\), which is a statistical measure representing the proportion of the variance for a dependent variable that’s explained by independent variables in a regression model, is lower here with respect to the standard OLS. With respect to the standard OLS estimations, magnitudes changes but not of sign. The only exception is mental health expenditure which, here, appears having a positive effect on criminality. Although, mh_exp_pc and education’s proxy are not statistically significant anymore. Additionally, among races, only the percentage of asian in the population seems statistically significant and still negative influencing criminality. As in the OLS estimates, \(log(population)\) decreases criminality: as population increases by 1%, criminality decreases by 16 crimes per 1000 inhabitants circa.

Among the various regressions we run, only the ones with rape and homicide as dependent variables have different results from the one just presented above.

For Rape:

Fixed Effect, dependent variable: Rapes per 1000 inhabitants
  rape_legacy
Predictors Estimates CI p
Current_dollar_GDP_millions
[log]
0.0762 0.0087 – 0.1438 0.027
mh_exp_pc [log] 0.0131 -0.0081 – 0.0343 0.226
perc_bscholder_25_44 -0.0001 -0.0044 – 0.0043 0.976
White 1.2084 -1.6239 – 4.0407 0.403
BlackAfricanAmerican 1.5597 -1.4553 – 4.5746 0.311
Asian -0.2739 -3.6396 – 3.0918 0.873
Age_0_17 2.8333 -7.9549 – 13.6216 0.607
Age_18_24 3.3037 -8.6452 – 15.2527 0.588
Age_25_44 5.9642 -5.3975 – 17.3260 0.304
Age_45_64 4.9356 -6.7876 – 16.6588 0.410
Age_65_84 3.1876 -8.5273 – 14.9025 0.594
population [log] -0.3180 -0.5715 – -0.0644 0.014
Observations 505
R2 / R2 adjusted 0.216 / 0.106

From the FE regression with Rape per 1000 inhabitants as dependent variable we learn that:

  • Rape appears, in magnitude, less impacted by GDP and population than whole criminality, but still respectively positively and negatively,
  • \(R^2\) is very low, so probably the regression does not explicate of the variance in Rape in a state in a certain year,
  • Only GDP and \(log(Population)\) are statistically significant, the other variables are not.

For Homicides:

Fixed Effect, dependent variable: Homicides per 1000 inhabitants
  homicide
Predictors Estimates CI p
Current_dollar_GDP_millions
[log]
0.051 0.037 – 0.065 <0.001
mh_exp_pc [log] 0.004 -0.001 – 0.008 0.084
perc_bscholder_25_44 -0.001 -0.002 – -0.000 0.004
White -0.061 -0.661 – 0.539 0.842
BlackAfricanAmerican 1.212 0.574 – 1.851 <0.001
Asian -0.820 -1.533 – -0.107 0.025
Age_0_17 2.520 0.235 – 4.805 0.031
Age_18_24 1.483 -1.048 – 4.014 0.251
Age_25_44 0.931 -1.475 – 3.338 0.448
Age_45_64 1.394 -1.088 – 3.877 0.272
Age_65_84 2.087 -0.394 – 4.568 0.100
population [log] -0.209 -0.262 – -0.155 <0.001
Observations 505
R2 / R2 adjusted 0.683 / 0.638

From the FE regression with Homicides per 1000 inhabitants as dependent variable we learn that:

  • Rape appears, in magnitude, less impacted by GDP and population than whole criminality, but still respectively positively and negatively,
  • \(R^2=0.683\), so the variables in the regression explain 68% of the variance in homicides per 1000 people in a state in a certain year,
  • Black African American and Asian percentages in the population have a, respectively, statistically significant positive and negative effect on criminality per 1000 terms. *Having a high percentage of very young population appears to increase homicides at 5% significance, but this is difficult to explain through social mechanism in a community for us.

Is there any relationship between expenditure for mental health by the government and criminality?

The answer is inconclusive. Our study and analysis reports slightly positive correlations with crimes if we look at the Corrplot’s Figure (only exception is with Rape), but from the regression it doesn’t result statistically significant. Although, the relationship between mental health expenditure and crimes appears negative from the scatterplot and the time series we have seen in some section above.

Is the level of education and wealth (through GDP) of a State relevant for its level of criminality?

For GDP we can say that:

  • Its relationship with criminality is coherent throughout all our analysis. The outcome of our study is a positive effect of GDP on criminality.
  • For Rape the answer is more dubious. From the regression we learn that the impact is positive but much lower than for total criminality as a whole. Instead, from the following scatterplot we would say that as GDP increases, Rape decreases
  • Looking at the Fixed Effect (FE)’s estimation for total criminality we can interpret the coefficient as, if GDP in millions of dollars increases by 1%, the number of crimes in a state in a given year increases by 3.88 per 1000 inhabitants. This is statistically significant.

For Education we can say that:

  • its relationship with criminality is negative, so the higher the percentage of population with age between 25 and 44 years old holding a Bachelor’s Degree, the lower the number of crimes per 1000 people. This was confirmed from scatterplots, corrplot, time series plot and regressions.
  • FE’s regressions report it to be significant only for homicides. In this one the magnitude of the effect is very small, though.
  • OLS’s estimations interpretation is that for and increase in the percentage of population with age between 25 and 44 years old holding a Bachelor’s Degree by 1 percentage points, total criminality would decrease by 0.09 per 1000 inhabitants. It doesn’t seem a big number.
  • A consideration we can make by looking at the barplot presented in the univariate visualization section is that, North-East and Mid-West regions have the highest percentage of educated population and lower incidence of crimes with respect to South and West.

Is the composition of the population, in terms of both age and ethnicity, relevant for criminality in the area?

  • Population’s age among different states and regions does not vary significantly, therefore, through our study we can’t say much. The only thing we can extrapolate from our project regarding age-distribution comes from the corrplot. A younger population (18-44) leads to higher homicides, aggravated assaults and violent crimes. Meanwhile, older population (45+) appears negatively related with crimes. But, regressions’ output are inconclusive since estimates are all positive and with great magnitudes.

  • Population’s race composition could play a role. Indeed, we see that South region in US has the highest percentage of Black-African American and the highest incidence of crimes, supporting the positive correlation found on the corrplot between all kinds of crimes and Black-African American. White population is positively correlated with rape. Although from the regressions we observe that the coefficients for all races are negative when looking at total criminality. For homicides, the significant estimates for race are for black african american (1 percentage point increase in black-african american population leads to 1 homicide more in 1000 inhabitants) and asian (1 percentage point increase in asiatic population leads to 0.8 homicide less in 1000 inhabitants).

Is mental health expenditure affected by how much the population is educated or by GDP of the country?

By looking at correlations and the time series reported in previous section, we would answer yes. It exists a positive relationship between the two variables, thus, the more educated the population, the higher the expenditure on mental health in the state. We can represent this findings also in the following scatterplot with the linear regression.